# MP4 Presentation

RISC-y\_Business3

Alex Vetsavong • Peter Kircher • Mohan Li

#### Design Features

#### Cache hierarchy:

- 16-set, direct-mapped split L1 caches with registered arrays (512B each)
- 32-set, 4-way unified L2 cache with BRAM arrays (32K)

#### M-extension:

- 32-bit \* 32-bit add-shift multiplier
- 32-bit divider

### Performance Metrics (bad)

- Fmax suffered from single cycle reads on multi-way L1 with BRAM.
- Minimal performance increase
- L2 did the heavy lifting

| Parameter            | Baseline | 2-way L2 | 4-way L1, L2 |
|----------------------|----------|----------|--------------|
| Total cycles         | 188214   | 88647    | 85545        |
| Fmax (MHz)           | 52.1     | 47.2     | 44.8         |
| Main memory accesses | 2324     | 343      | 332          |
| Memory stall cycles  | 125091   | 25524    | 22422        |
| L1 I-Cache hit rate  | 99.9%    | 99.9%    | 99.9%        |
| L1 D-Cache hit rate  | 78.8%    | 79.3%    | 85.7%        |
| L2 hit rate          | N/A      | 85.2%    | 79.6%        |

## Performance Metrics (good)

- Fmax substantially better
  - ~58.65% 85.89% increase from prev.

| comp1.s | Metrics        | Baseline  | w/ 4-way L2 | w/Multiplier | Both features |
|---------|----------------|-----------|-------------|--------------|---------------|
|         | Time (ns)      | 1,200,254 | 773,259     | 1,200,254    | 773,259       |
|         | Power (mW)     | 517.79    | 593.81      | 517.79       | 593.81        |
|         | BRAM size (Kb) | 0.736     | 33.504      | 0.736        | 33.504        |
|         | I-misses       | 1044      | 1044        | 1044         | 1044          |
|         | I-serves       | 57731     | 57750       | 57731        | 57750         |
|         | D-misses       | 7         | 7           | 7            | 7             |
|         | D-serves       | 2945      | 2932        | 2945         | 2932          |
|         | L2 Misses      | N/A       | 32          | N/A          | 32            |
|         | L2 Serves      | N/A       | 1051        | N/A          | 1051          |
|         | Fmax (MHz)     | 83.28     | 81.23       | 83.28        | 81.23         |
|         |                |           |             |              | 0             |

| comp3.s | Metrics        | Baseline  | w/ 4-way L2 | w/Multiplier | Both features |
|---------|----------------|-----------|-------------|--------------|---------------|
|         | Time (ns)      | 3,632,258 | 1,163,707   | 3,632,258    | 1,163,707     |
|         | Power (mW)     | 466.52    | 538.07      | 466.52       | 538.07        |
|         | BRAM size (Kb) | 0.736     | 33.504      | 0.736        | 33.504        |
|         | I-misses       | 5905      | 5905        | 5905         | 5905          |
|         | I-serves       | 83983     | 69701       | 83983        | 69701         |
|         | D-misses       | 502       | 502         | 502          | 502           |
|         | D-serves       | 41724     | 18354       | 41724        | 18354         |
|         | L2 Misses      | N/A       | 316         | N/A          | 316           |
|         | L2 Serves      | N/A       | 6407        | N/A          | 6407          |
|         | Fmax (MHz)     | 83.28     | 81.23       | 83.28        | 81.23         |

| comp2.s | Metrics        | Baseline  | w/ 4-way L2 | w/Multiplier | Both features |
|---------|----------------|-----------|-------------|--------------|---------------|
|         | Time (ns)      | 3,821,912 | 1,911,601   | 1,602,546    | 1,619,941     |
|         | Power (mW)     | 488.5     | 600.13      | 463.36       | 534.04        |
|         | BRAM size (Kb) | 0.736     | 33.504      | 0.736        | 33.504        |
|         | I-misses       | 4685      | 4685        | 22           | 22            |
|         | I-serves       | 141904    | 135050      | 132450       | 130664        |
|         | D-misses       | 146       | 146         | 66           | 66            |
|         | D-serves       | 12673     | 5216        | 3292         | 3278          |
|         | L2 Misses      | N/A       | 65          | N/A          | 45            |
|         | L2 Serves      | N/A       | 4831        | N/A          | 88            |
|         | Fmax (MHz)     | 83.28     | 81.23       | 83.28        | 81.23         |

### Takeaways from Metrics

- L2 Cache across the board gives good gains
  - Ranging from 35.58% to 67.96%

| comp1.s | Metrics   | Baseline  | w/ 4-way L2 | comp3.s | Metrics   | Baseline  | w/ 4-way L2 | comp2.s | Metrics   | Baseline  | w/ 4-way L2 |
|---------|-----------|-----------|-------------|---------|-----------|-----------|-------------|---------|-----------|-----------|-------------|
|         | Time (ns) | 1,200,254 | 773,259     |         | Time (ns) | 3,632,258 | 1,163,707   |         | Time (ns) | 3,821,912 | 1,911,601   |

- L2 Cache gives little performance over multiplier in comp2\_m.s
  - o comp2\_i.s vs comp2\_m.s instruction count difference

| comp2.s | Metrics        | Baseline  | w/ 4-way L2 | w/Multiplier | Both features |
|---------|----------------|-----------|-------------|--------------|---------------|
|         | Time (ns)      | 3,821,912 | 1,911,601   | 1,602,546    | 1,619,941     |
|         | Power (mW)     | 488.5     | 600.13      | 463.36       | 534.04        |
|         | BRAM size (Kb) | 0.736     | 33.504      | 0.736        | 33.504        |
|         | I-misses       | 4685      | 4685        | 22           | 22            |
|         | I-serves       | 141904    | 135050      | 132450       | 130664        |
|         | D-misses       | 146       | 146         | 66           | 66            |
|         | D-serves       | 12673     | 5216        | 3292         | 3278          |
|         | L2 Misses      | N/A       | 65          | N/A          | 45            |
|         | L2 Serves      | N/A       | 4831        | N/A          | 88            |

#### With more time we would...

- Rewrite forwarding logic → improve Fmax
- Add-shift multiplier → Wallace fast multiplier
- Tune cache parameters for ideal delay/power tradeoff
- Implement performant prefetching scheme